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Abstract. Membership diversity is a characteristic aspect of social net- 
works in which a person may belong to more than one social group. For 
this reason, discovering overlapping structures is necessary for realistic 
social analysis. In this paper, we present a fast algorithrrQ, called SLPA, 
for overlapping community detection in large-scale networks. SLPA spreads 
labels according to dynamic interaction rules. It can be applied to both 
unipartite and bipartite networks. It is also able to uncover overlapping 
nested hierarchy. The time complexity of SLPA scales linearly with the 
number of edges in the network. Experiments in both synthetic and real- 
world networks show that SLPA has an excellent performance in identi- 
fying both node and community level overlapping structures. 



1 Introduction 

Community or modular structure is considered to be a significant property of 
real-world social networks. Thus, numerous techniques have been developed for 
effective community detection. However, most of the work has been done on 
disjoint community detection. It has been well understood that people in a real 
social network are naturally characterized by multiple community memberships. 
For example, a person usually has connections to several social groups like family, 
friends and colleges; a researcher may be active in several areas; in the Internet, 
a person can simultaneously subscribe to an arbitrary number of groups. 

For this reason, overlapping community detection algorithms have been in- 
vestigated. These algorithms aim to discover a cover |7j, defined as a set of 
clusters in which each node belongs to at least one cluster. In this paper, we 
propose an efficient algorithm for detecting both individual overlapping nodes 
and overlapping communities using the underlying network structure alone. 

2 Related Work 

We review the state of the art and categorize existing algorithms into five classes 
that reflect how communities are identified. 



^ Available at |https : //sites .google . com/site/conmmnitydetectionslpa/| 
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Clique Percolation: CPM [TT] is based on the assumption tliat a commu- 
nity consists of fully connected subgraphs and detects overlapping communities 
by searching for adjacent cliques. CPMw |4] extends CPM for weighted networks 
by introducing a subgraph intensity threshold. 

Local Expansion: The iterative scan algorithm (IS) [2], [6] expands small 
cluster cores by adding or removing nodes until a local density function cannot 
be improved. The quality of seeds dictates the quality of discovered communities. 
LFM [7] expands a community from a random node. The size and quality of the 
detected communities depends significantly on the resolution parameter of the 
fitness function. EAGLE [TJ and GCE [9] start with all maximal cliques in the 
network as initial communities. EAGLE uses the agglomerative framework to 
produce a dendrogram in 0{n?s) time, where n is the number of nodes, and s 
is the maximal number of join operations. In GCE communities that are similar 
within a distance e are removed. The greedy expansion takes 0{mh) time, where 
m is the number of edges, and h is the number of cliques. 

Fuzzy Clustering: Zhang [16] used the spectral method to embed the graph 
into low dimensionality Euclidean space. Nodes are then clustered by the fuzzy 
c-mean algorithm. Psorakis et al. }12| proposed a model based on Bayesian non- 
negative matrix factorization (NMF). These algorithms need to determine the 
number of communities K and the use of matrix multiplication makes them 
inefficient. For NMF, the complexity is 0{Kii?). 

Link Partitioning: Partitioning links instead of nodes to discover commu- 
nities has been explored, where the node partition of a link graph leads to an 
edge partition of the original graph. In [1] , single-linkage hierarchical clustering 
is used to build a link dendrogram. The time complexity is 0(nfc^^^), where 
kmax is the highest degree of the n nodes. 

Dynamical Algorithms: Label propagation algorithms such as |I3|5|15| use 
labels to uncover communities. In COPRA [5], each node updates its belonging 
coefficients by averaging the coefficients from all its neighbors in a synchronous 
fashion. The time complexity is 0{vm log{vm/n)) per iteration, where parameter 
V controls the maximum number of communities with which a node can associate, 
m and n are the number of edges and number of nodes respectively. 

3 SLPA: Speaker-listener Label Propagation Algorithm 

Our algorithm is an extension of the Label Propagation Algorithm (LPA) [13] . 
In LPA, each node holds only a single label and iteratively updates it to its 
neighborhood majority label. Disjoint communities are discovered when the al- 
gorithm converges. Like [5], our algorithm accounts for overlap by allowing each 
node to possess multiple labels but it uses different dynamics with more general 
features. 

SLPA mimics human pairwise communication behavior. In each communica- 
tion step, one node is a speaker (information provider), and the other is a listener 
(information consumer). Unlike other algorithms, each node has a memory of 
the labels received in the past and takes its content into account to make the 
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Fig. 1: The execution times of SLPA 
in synthetic networks with n = 5000 
and average degree k varying from 
10 to 80. 

current decisions. This allows SLPA to avoid producing a number of small size 
communities as opposed to other algorithms. In a nutshell, SLPA consists of the 
following three stages: 



Algorithm 1 : SLPA(r, r) 

T: the user defined maximum iteration 
r: post-processing threshold 

1) First, the memory of each node is initialized with a unique label. 

2) Then, the following steps are repeated until the maximum iteration T is reached: 

a. One node is selected as a listener. 

b. Each neighbor of the selected node randomly selects a label with probability 
proportional to the occurrence frequency of this label in its memory and sends 
the selected label to the listener. 

c. The listener adds the most popular label received to its memory. 

3) Finally, the post-processing based on the labels in the memories and the threshold 

r is applied to output the communities. 



Note that SLPA starts with each node being in its own community (a total 
of n), the algorithm explores the network and outputs the desired number of 
communities in the end. As such, the number of communities is not required as 
an input. Due to the step c, the size of memory increases by one for each node 
at each step. SLPA reduces to LPA when the size of memory is limited to one 
and the stop criterion is convergence of all labels. Empirically, SLPA produces 
relatively stable outputs, independent of network size or structure, when T is 
greater than 20. Although SLPA is non-deterministic due to the random selection 
and ties, it performs well on average as shown in later sections. 

Post-processing and Community Detection: In SLPA, the detection of 
communities is performed when the stored information is post-processed. Given 
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the memory of a node, SLPA converts it into a probability distribution of labels. 
Since labels represent community id's, this distribution naturally defines the 
strength of association to communities to which the node belongs. To produce 
crisp communities in which the membership of a node to a given community 
is binary, i.e., either a node is in a community or not, a simple thresholding 
procedure is performed: if the probability of seeing a particular label during the 
whole process is less than a given threshold r g [0,0.5], this label is deleted. After 
thresholding, connected nodes having a particular label are grouped together and 
form a community. If a node contains multiple labels, it belongs to more than 
one community and is called an overlapping node. A smaller value of r produces 
a larger number of communities. However, the effective range is typically narrow 
in practice. When r > 0.5, SLPA outputs disjoint communities. 

Complexity: The initialization of labels requires 0(n), where n is the total 
number of nodes. The outer loop is controlled by the user defined maximum 
iteration T, which is a small constant which in our experiments was set to 100. 
The inner loop is controlled by n. Each operation of the inner loop executes one 
speaking rule and one listening rule. The speaking rule requires exactly 0(1) 
operation. The listening rule takes 0{k) on average, where k is the average node 
degree. In the post-processing, the thresholding operation requires 0{Tn) oper- 
ations since each node has a memory of size T. In summary, the time complexity 
of the entire algorithm is 0{Tnk) or 0{Tm), linear with the total number of 
edges m. The execution times for synthetic networks where averaged for each k 
over networks with different structures, i.e., different degree and community size 
distributions (see Section |4T] for details). The results shown in Fig. [T] confirm the 
linear scaling of the execution times. On a desktop with 2.8GGIIz CPU, SLPA 
took about six minutes to run over a two million nodes Amazon co-purchasing, 
which is ten times faster than GCE running over the same network. 



4 Tests in Synthetic Networks 



4.1 Methodology 



To study the behavior of SLPA, we conducted extensive experiments in both 
synthetic and real-world networks. For synthetic random networks, we adopted 
the widely used LFR benchmarlH 0, which allows heterogeneous distributions 
of node degrees and community sizes. 



: //sites .google . com/site/andrealancichinetti/f iles 



V 



Table 1: Algorithms in the tests. Table 2: The ranking of algorithms. 



Algorithm 


Complexity 


Imp 


Ranli RSo,nega 


RSnmi 


RSf 


CFiiider [TT], 2005 




C++ 


1 


SLPA 


SLPA 


SLPA 


LFM HI, 2009 


0{n's) 


C++ 


2 


COPRA 


GCE 


CFinder 


EAGLE [HI, 2009 


C++ 


3 


GCE 


NMF 


COPRA 


CIS 0, 2009 




C++ 


4 


CIS 


CIS 


Link 


GCE 11, 2010 


0(mh) 


C++ 


5 


NMF 


LFM 


LFM 


COPRA 0, 2010 O{vmlog(vm/n)) 


Java 


6 


LFM 


COPRA 


CIS 


NMF [12], 2010 


O(Kn') 


Matlab 


7 


CFinder 


CFinder 


GCE 


Link P, 2010 


C++ 


8 


Link 


EAGLE 


EAGLE 


SLPA, 2011 


0{Tm) 


C++ 


9 


EAGLE 


Link 


NMF 



We used networks with size n = 5000. The average degree is kept at A: = 10. 
The degree of overlapping is determined by two parameters. On defines the 
number of overlapping nodes and is set to 10% of all nodes. Om defines the 
number of communities to which each overlapping node belongs and varies from 
2 to 8 indicating the diversity of overlap. By increasing the value of Om, we 
create harder detection tasks. Other parameters are as follows: node degrees and 
community sizes are governed by the power laws with exponents 2 and 1; the 
maximum degree is 50; the community size varies from 20 to 100; the expected 
fraction of links of a node connecting it to other communities, called the mixing 
parameter n, is set to 0.3. We generated ten instance networks for each setting. 

In Table [U wc compared SLPA with eight other algorithms representing dif- 
ferent categories discussed in section [2] For algorithms with tunable parameters, 
the performance with the optimal parameter is reported. For CFinder, k varies 
from 3 to 10; for COPRA, v varies from 1 to 10; for LFM a is set to 1.0 [7]. For 
Link, the threshold varies from 0.1 to 0.9 with an interval 0.1. For SLPA, the 
number of iterations T is set to 100 and r varies from 0.01 to 0.1. The average 
performance together with error bar over ten repetitions arc reported for SLPA 
and COPRA. For NMF, we applied a threshold varying from 0.05 to 0.5 with 
an interval 0.05 to convert it to a crisp clustering. 

To summarize the vast amount of comparison results and provide a measure 
of relative performance, we proposed i?S'A/(i), the averaged ranking for algorithm 
i with respect to measure M as follows: 

RSnii) — ^ Wj ■ rank{i, O^), (1) 
j=i 

where O^^ is the number of memberships in {2, 3, ■ • • , 8}, Wj is the weight, and 
function rank returns the ranking of algorithm i for the given Om- For simplicity, 
we assume equal weights over different Om's in this paper. Sorting RSm in 
increasing order gives the final ranking among algorithms. 

4.2 Identifying Overlapping Communities in LFR 

The extended normalized mutual information (NMI) [7] and Omega Index [3] 
are used to quantify the quality of communities discovered by an algorithm. NMI 
measures the fraction of nodes in agreement in two covers, while Omega is based 
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Fig. 4: The unimodal histogram of 
the detected community sizes for 
SLPA, GCE and NMF. 



Fig. 5: The bimodal histogram of the 
detected community sizes for CO- 
PRA, CIS, LFM, Link, EAGLE and 
CFinder. 



on pairs of nodes. Both NMI and Omega yield the values between and 1. The 
closer this value is to 1, the better the performance is. 

As shown in Fig. [5] and Fig. [31 some algorithms behave differently under 
different measures (the rankings of RSomega and RSnmi among algorithms in 
Table [2] also change). As opposed to NMF and COPRA, which are especially 
sensitive to the measure, SLPA is remarkably stable in NMI and Omega. 

Comparing the detected and known numbers of communities and distribu- 
tions of community sizes (CS) helps to understand the results. On one hand, we 
expect the community size to follow a power law with exponent 1 and to range 
from 20 to 100 by design. As shown in Fig. |4l high-ranking (with high NMI) 
algorithms such as SLPA, GCE and NMF typically yield a unimodal distribu- 
tion with a peak at CS" = 20 fitting well with the ground truth distribution. 
In contrast, algorithms in Fig. [S] typically produce a bimodal distribution. The 
existence of an extra dominant mode for CS ranging from 1 to 5 results in a 
significant number of small size communities in CFinder, LFM, COPRA, Link 
and CIS. These observations nicely explain the ranking with respect to NMI. 
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4.3 Identifying Overlapping Nodes in LFR 

Identifying nodes overlapping multiple communities is an essential component of 
measuring the quality of a detection algorithm. However, the node level evalua- 
tion was often neglected. Here we first look at the number of detected overlapping 
nodes (see Fig. |6]) and detected memberships (see Fig. [7]) relative to the 
ground truth 0„ and 0,„ , based on the information in Fig. [2] Note that a value 
close to 1 indicates closeness to the ground truth, and values over 1 are possible 
when an algorithm detects more nodes or memberships than there are known to 
exist. As shown, SLPA yields the numbers that are close to the ground truth in 
both cases. 

Note that Of^ alone is insufficient to accurately quantify the detection perfor- 
mance, as it contains both true and false positive. To provide precise analysis, we 
consider the identification of overlapping nodes as a binary classification prob- 
lem. A node is labeled as overlapping as long as Om>^ or 0^>1 and labeled 
as non-overlapping otherwise. Within this framework, we can use F-score as a 
measure of detection accuracy defined as 

p 2 ■ precision ■ recall .^^ 
precision + recall ' 

where recall is the number of correctly detected overlapping nodes divided by 
On, and precision is the number of correctly detected overlapping nodes divided 
by F-score reaches its best value at 1 and worst score at 0. 

As shown in Fig. [51 SLPA achieves the best score on this metric. This score 
has a positive correlation with 0,„ while scores of other algorithms are negatively 
correlated with it. SLPA correctly uncovers a reasonable fraction of overlapping 
nodes even when those nodes belong to many groups (as demonstrated by the 
high precision and recall in Fig. Inland Fig. [TU)) . Other algorithms that fail to 
have a good balance between precision and recall result in low F-score, especially 
for EAGLE and Link. The high precision of EAGLE (also CFinder and GCE for 
Om = 2) shows that clique-like assumption of communities may help to identify 
overlapping nodes. However, they under-detect the number of such nodes. 

With the F-score ranking, GCE and NMF no longer rank in the top three 
algorithms, while SLPA stays there. Taking both community level performance 
(NMI and Omega) and node level performance (F-score) into account, we con- 
clude that SLPA performs well in the LFR benchmarks. 

5 Tests in Real-world Social Networks 

We applied SLPA to a wide range of well-known social networktU as listed in 
Table [31 The high school friendship networks that were analyzed in a project 
funded by the National Institute of Child Health and Human Development, are 
social networks in high schools self-reported by students together with their 
grades, races and sexes. We used these additional attributes for verification. 

^ jwww-personal .Timich. edu/~mejn/netdata/|and|snap. stanf ord. edu/data/j 
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Fig. 6: The number of detected over- 
lapping nodes relative to the ground 
truth. 



Fig. 7: The number of memberships 
of detected overlapping nodes rela- 
tive to the ground truth. 




5.1 Identifying Overlapping Communities in Social Networks 

To quantify the performance, we used the overlapping modularity, Q^^ (with 
values between and 1), proposed by Nicosia [TU], which is an extension of New- 
man's modularity. A high value indicates a significant overlapping community 
structure relative to the null model. We removed CFinder, EAGLE and NMF 
from the test because of either their memory or their computation inefficiency 
on large networks. As an additional reference, we added the disjoint detection 
results with the Infomap algorithm. 

As shown in Fig. [Til in general, SLPA achieves the highest average 
followed by LFM and COPRA, even though the performance of SLPA has larger 
fluctuation than that in synthetic networks. Compared with COPRA, SLPA is 
more stable as evidenced by smaller deviation of its Q^* score. In contrast, 
COPRA does not work well on highly sparse networks such as P2P, for which 
COPRA finds merely one single giant community. COPRA also fails on Epinions 
network because it claims too many overlapping nodes in view of consensus of 
other algorithms as seen in the bottom of Fig. [121 Such over-detection also 
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Table 3: Social networks in the tests 

Network n k Network n k 

karate (KR) 34 45 Email (EM) 33696 10.7 

football (FB) 115 10.6 P2P 62561 2.4 

lesmis (LS) 77 6.6 Epinions (EP) 75877 10.6 

dolphins (DP) 62 5.1 Amazon (AM) 262111 6.8 

CA-GrQc (CA) 4730 5.6 HighSchool (HSl) 69 6.3 

PGP 10680 4.5 HighSchool (HS2) 612 8.0 




KR FB LS HS1 DP CAPGPEM EPP2PAM 



Fig. 11: Overlapping modularity Q[ 




KR FB LSHS1 DP CAPGPEM EPP2PAM 



Fig. 12: The number of detected 
memberships (top) and the fraction 
of detected overlapping nodes (bot- 
tom). 




applies to CIS and Link, resulting in low Q^^ scores for these two algorithms. 
The results in Fig. [12] (based on the clustering with the best Q^y) show a common 
feature in the tested real- world networks, which is a relatively little agreement 
between results of different algorithms, i.e., the relatively small overlap in both 
the fraction of overlapping nodes (typically less than 30%) and the number of 
communities of which an overlapping node is a member (typically 2 or 3). 

As known, a high modularity might not necessarily result in a true parti- 
tioning as it does in the disjoint community detection. We used the high school 
network (HSl) with known attributes to verify the output of SLPA. As shown 
in Fig. 1131 there is a good agreement between the found and known partitions 
in term of student's grades. In SLPA, the grade 9 community is further divided 
into two subgroups. The larger group contains only white students, while the 
smaller group demonstrates race diversity. These two groups are connected par- 
tially via an overlapping node. It is also clear that overlapping nodes only exist 
on the boundaries of communities. A few overlapping nodes are assigned to three 
communities, while the others are assigned to two communities (i.e., their Om 
is 2). 
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Fig. 13: High school network (n = 69, k ~ 6.4). Labels are the known grades 
ranging from 7 to 12. Colors represent communities discovered by SLPA. The 
overlapping nodes are highlighted by red color. 



5.2 Identifying Overlapping Communities in Bipartite Networks 

Discovering communities in bipartite networks is important because they provide 
a natural representation of many social networks. One example is the online tag- 
ging system with both users and tags. Unlike the original LPA algorithm, which 
performs poorly on bipartite networks, SLPA works well on this kind of networks. 
We demonstrate this using two real-world nctworkfl One is a Faccbook-like so- 
cial network. One type of nodes represents users (abbr. FB-Ml), while the other 
represents messages (abbr. FB-M2). The second network is the interlocking di- 
rectorate. One type of nodes represents affiliations (abbr. IL-Ml), while the other 
individuals (abbr. IL-M2). 

We compared SLPA with COPRA in Tabic IH One difference between SLPA 
and COPRA is that SLPA applies to the entire bipartite network directly, while 
COPRA is applied to each type of nodes alternatively. Q^^ is computed on the 
projection of each type of nodes. Again, we allow overlapping between commu- 
nities. Although COPRA is slightly better (by 0.03) than SLPA on the second 
type of nodes for interlock network, it is much worse (by 0.11) on the first type. 
Moreover, COPRA fails to detect meaningful communities in the Facebook-like 
network, while SLPA demonstrates relatively good performance. 

5.3 Identifying Overlapping Nested Communities 

In the above experiments, we applied a post-processing to remove subset commu- 
nities from the raw output of stages 1 and 2 of SLPA. This may not be necessary 
for some applications. Here, we show that rich nested structure can be recovered 
in the high school network (HS2) with n = 612. The hierarchy is shown as a 

* Data are available at |http : //toreopsahl . com/datasets/| 
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Table 4: The Q^' of SLPA and COPRA for two bipartite networks. 



Network 


n 


SLPA (std) COPRA (std) 


FB-Ml 


899 


0.23 (0.10) 


0.02 (0.07) 


FB-M2 


522 


0.36 (0.02) 


0.02 (0.07) 


IL^Ml 


239 


0.59 (0.02) 


0.48 (0.02) 


IL-M2 


923 


0.69 (0.01) 


0.72 (0.01) 



Cl(68%) 


C10(97%) 


Cll(86%) 


C13(96%) 


C16(91%) 


Cl-25(100%) 


Cl-40(83%) 


CIO. 

35(100%) 


CIO. 
36(71%) 


Cll- 

38(100%) 


C13- 

45(100%) 


C13- 
46(100 

%) 


C16-32 
C16-32-39 
C16-32-39- 
49(75%) 



Fig. 14: The nested structure in the high school network represented as a 
Treemap. The color represents the best explaining attribute: blue for grade; 
green for race; and for sex. Numbers in parenthesis are the matching 

scores defined in the text. The size of shapes is proportional to the community 
size. Due to the page limit, only part of the entire treemap is shown. 

treema]^! shown in Fig. [Ml To evaluate the degree to which a discovered com- 
munity matches the known attributes, we define a matching score as the largest 
fraction of matched nodes relative to the community size among three attributes 
(i.e., grade, race and sex). The corresponding attribute is said to best explain 
the community found by SLPA. 

As shown, SLPA discovers a tree with a height of four. Most of the commu- 
nities are distributed on the first two levels. The community name shows the full 
hierarchy path (connected by a dash '-') leading to this community. For example, 
CI has id 1 and is located on the first level, while Cl-25 has id 25, and it is the 
second level sub-community of community with id 1. 

Nested structures are found across different attributes. For example, CIS is 
best explained by race, while its two sub-communities perfectly account for grade 
and sex respectively. In Ci, sub-communities explained by the same attribute 
account for different attribute values. For example, both Cl-25 and Cl-J^O are 
identified by sex. However, the former contains only male students, while in the 
latter female students are the majority. Although the treemap is not capable 
of displaying overlaps between communities, the nested structures overlap as 
before. 

6 Conclusions 

We introduced a dynamic interaction process, SLPA as a basis for an efficient 
and effective unified overlapping community detection algorithm. SLPA allows 



^ Treemap is used for visualization: [www . cs . umd . edu/hcil/treema] 
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us to analyze different kinds of community structures, such as disjoint com- 
munities, individual overlapping nodes, overlapping communities and overlap- 
ping nested hierarchy in both unipartitc and bipartite topologies. Its underlying 
process can be easily modified to accommodate other types of networks (e.g., 
k-partite graphs). In the future work, we plan to apply SLPA to temporal com- 
munity detection. 
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